#IT326 Project
This dataset, obtained from vgchartz.com, provides a valuable resource for exploring the dynamics between gaming platforms and genres in the top 100 global video games. It enables us to analyze the platforms that are influencing worldwide sales, identify the most prosperous genres in various global regions, and track the evolving trends in both platform preference and genre popularity over time.
Source: vgchartz.com
URL link: https://www.kaggle.com/datasets/gregorut/videogamesales
Our goal from studying this dataset is to utilize classification and clustering techniques on the input data to make predictions about the popularity of upcoming games.
This dataset has 11 attributes and 16599 objects.
Rank: Ranking of the game based on global sales.
Name: Name of the game.
Platform: Platform the game was released on.
Year: Year the game was released.
Genre: Genre of the game.
Publisher: Publisher of the game.
NA_Sales: Sales of the game in North America.
EU_Sales: Sales of the game in Europe.
JP_Sales: Sales of the game in Japan.
Other_Sales: Sales of the game in other regions.
Global_Sales: Total sales of the game worldwide.
Popular’ is our class label, we will use Global_Sales attribute to predict whether a game will sell 1000000 or more globally. Our task of data mining is regression.
dataset=read.csv("vgsales.csv")
Warning: cannot open file 'vgsales.csv': No such file or directoryError in file(file, "rt") : cannot open the connection
Importing our dataset.
library(outliers)
library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
library(Hmisc)
Registered S3 methods overwritten by 'htmltools':
method from
print.html tools:rstudio
print.shiny.tag tools:rstudio
print.shiny.tag.list tools:rstudio
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
Registered S3 method overwritten by 'data.table':
method from
print.data.table
Attaching package: ‘Hmisc’
The following objects are masked from ‘package:dplyr’:
src, summarize
The following objects are masked from ‘package:base’:
format.pval, units
library(ggplot2)
library(cowplot)
library(mlbench)
library(caret)
Loading required package: lattice
library(faux)
************
Welcome to faux. For support and examples visit:
https://debruine.github.io/faux/
- Get and set global package options with: faux_options()
************
library(DataExplorer)
library(randomForest)
randomForest 4.7-1.1
Type rfNews() to see new features/changes/bug fixes.
Attaching package: ‘randomForest’
The following object is masked from ‘package:ggplot2’:
margin
The following object is masked from ‘package:dplyr’:
combine
The following object is masked from ‘package:outliers’:
outlier
loading libraries needed for our data mining tasks.
nrow(dataset)
[1] 16598
ncol(dataset)
[1] 11
dim(dataset)
[1] 16598 11
names(dataset)
[1] "Rank" "Name" "Platform" "Year" "Genre"
[6] "Publisher" "NA_Sales" "EU_Sales" "JP_Sales" "Other_Sales"
[11] "Global_Sales"
General info about our dataset including number of rows and columns, and cheking dimensionality and coulumns names.
str(dataset)
'data.frame': 16598 obs. of 11 variables:
$ Rank : int 1 2 3 4 5 6 7 8 9 10 ...
$ Name : chr "Wii Sports" "Super Mario Bros." "Mario Kart Wii" "Wii Sports Resort" ...
$ Platform : chr "Wii" "NES" "Wii" "Wii" ...
$ Year : chr "2006" "1985" "2008" "2009" ...
$ Genre : chr "Sports" "Platform" "Racing" "Sports" ...
$ Publisher : chr "Nintendo" "Nintendo" "Nintendo" "Nintendo" ...
$ NA_Sales : num 41.5 29.1 15.8 15.8 11.3 ...
$ EU_Sales : num 29.02 3.58 12.88 11.01 8.89 ...
$ JP_Sales : num 3.77 6.81 3.79 3.28 10.22 ...
$ Other_Sales : num 8.46 0.77 3.31 2.96 1 0.58 2.9 2.85 2.26 0.47 ...
$ Global_Sales: num 82.7 40.2 35.8 33 31.4 ...
Dataset structure including number of coulums and rows, attribute types.
head(dataset, 10)
sample of raw dataset(first 10 rows).
tail(dataset, 10)
sample of raw dataset(last 10 rows).
summary(dataset)
Rank Name Platform Year Genre
Min. : 1 Length:16598 Length:16598 Length:16598 Length:16598
1st Qu.: 4151 Class :character Class :character Class :character Class :character
Median : 8300 Mode :character Mode :character Mode :character Mode :character
Mean : 8301
3rd Qu.:12450
Max. :16600
Publisher NA_Sales EU_Sales JP_Sales
Length:16598 Min. : 0.0000 Min. : 0.0000 Min. : 0.00000
Class :character 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.00000
Mode :character Median : 0.0800 Median : 0.0200 Median : 0.00000
Mean : 0.2647 Mean : 0.1467 Mean : 0.07778
3rd Qu.: 0.2400 3rd Qu.: 0.1100 3rd Qu.: 0.04000
Max. :41.4900 Max. :29.0200 Max. :10.22000
Other_Sales Global_Sales
Min. : 0.00000 Min. : 0.0100
1st Qu.: 0.00000 1st Qu.: 0.0600
Median : 0.01000 Median : 0.1700
Mean : 0.04806 Mean : 0.5374
3rd Qu.: 0.04000 3rd Qu.: 0.4700
Max. :10.57000 Max. :82.7400
summary of our dataset.
var(dataset$NA_Sales)
[1] 0.6669712
var(dataset$EU_Sales)
[1] 0.2553799
var(dataset$JP_Sales)
[1] 0.0956607
var(dataset$Other_Sales)
[1] 0.03556559
var(dataset$Global_Sales)
[1] 2.418112
variance of numeric data
dataset2 <- dataset %>% sample_n(50)
tab <- dataset2$Platform %>% table()
precentages <- tab %>% prop.table() %>% round(3) * 100
txt <- paste0(names(tab), '\n', precentages, '%')
pie(tab, labels=txt , main = "Pie chart of Platform")
We notice from the pie chart of platform attribute that releasing a game for PS users will increase the popularity of the game since it is the most common platform among gamers.
# coloring barplot and adding text
tab<-dataset$Genre %>% table()
precentages<-tab %>% prop.table() %>% round(3)*100
txt<-paste0(names(tab), '\n',precentages,'%')
bb <- dataset$Genre %>% table() %>% barplot(axisnames=F, main = "Barplot for Popular genres ",ylab='count',col=c('pink','blue','lightblue','green','lightgreen','red','orange','red','grey','yellow','azure','olivedrab'))
text(bb,tab/2,labels=txt,cex=1.5)
In terms of genre, action games are the most popular, followed by sports and music games. It is safe to assume that a high number of genres of this nature exist due to their popularity and sales.
boxplot(dataset$NA_Sales , main="
BoxPlot for NA_Sales")
boxplot(dataset$EU_Sales, main="
BoxPlot for EU_Sales")
boxplot(dataset$JP_Sales , main="
BoxPlot for JP_Sales")
boxplot(dataset$Other_Sales , main="
BoxPlot for Other_Sales")
The boxplot of the Other-sales attribute indicate that the values are close to each other ,and there is a lot of outliers since the dataset represents the global sales of video games.
boxplot(dataset$Global_Sales , main="BoxPlot for Global_Sales")
The boxplot of the Global-sales attribute indicate that the values are close to each other ,and there is a lot of outliers since the dataset represents the global sales of video games.
qplot(data = dataset, x=Global_Sales,y=Genre,fill=I("yellow"),width=0.5 ,geom = "boxplot" , main = "BoxPlots for genre and Global_Sales")
Warning: `qplot()` was deprecated in ggplot2 3.4.0.
dataset$Year %>% table() %>% barplot( main = "Barplot for year")
pairs(~NA_Sales + EU_Sales + JP_Sales + Other_Sales + Global_Sales, data = dataset,
main = "Sales Scatterplot")
We used Scatterplot to determine the type of correlation we have between the sales; we can see that the majority have positive correlation with each other.
dataset$Rank=as.character(dataset$Rank)
Rank should be char and not numeric,because we will use them as ordinal data.
sum(is.na(dataset$Rank))
[1] 0
NullRank<-dataset[dataset$Rank=="N/A",]
NullRank
checking for nulls in Rank (there is no nulls)
sum(is.na(dataset$Name))
NullName<-dataset[dataset$Name=="N/A",]
NullName
checking for nulls in name (there is no nulls)
sum(is.na(dataset$Platform))
[1] 0
NullPlatform<-dataset[dataset$Platform=="N/A",]
checking for nulls in Platform(there is no nulls)
sum(is.na(dataset$year))
[1] 0
NullYear<-dataset[dataset$Year=="N/A",]
NullYear
checking for nulls in year we won’t delete the null and we will leave them as global constant as it is because we want the sales data of them.
sum(is.na(dataset$Other_Sales))
[1] 0
NullOther_Sales<-dataset[dataset$Other_Sales=="N/A",]
There is no null values in the other_sales.
sum(is.na(dataset$Genre))
[1] 0
NullGenre<-dataset[dataset$Genre=="N/A",]
NullGenre
checking for nulls in Genre(there is no nulls)
sum(is.na(dataset$Publisher))
[1] 0
NullPublisher<-dataset[dataset$Publisher=="N/A",]
NullPublisher
checking for nulls in Publisher. we won’t delete the null and we will leave them as global constant as it is because we want the sales data of them.
sum(is.na(dataset$Global_Sales))
[1] 0
NullGlobal_Sales<-dataset[dataset$Global_Saless=="N/A",]
There is no null values in the Global_Sales.
dataset$Platform=factor(dataset$Platform,levels=c("2600","3DO","3DS","DC","DS","GB","GBA","GC","GEN","GG","N64","NES","NG","PC","PCFX","PS","PS2","PS3","PS4","PSP","PSV","SAT","SCD","SNES","TG16","Wii","WiiU","WS","X360","XB","XOne"), labels=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31))
Since most machine learning algorithms work with numbers and not with text or categorical variables, this column will be encoded.
dataset$Genre=factor(dataset$Genre,levels=c("Action","Adventure","Fighting","Platform","Puzzle","Racing","Role-Playing","Shooter","Simulation","Sports","Strategy","Misc"),labels=c(1,2,3,4,5,6,7,8,9,10,11,12))
Since most machine learning algorithms work with numbers and not with text or categorical variables, this column will be encoded.
outlier of NA_Sales
OutNA_Sales = outlier(dataset$NA_Sales, logical =TRUE)
Error in if (nrow(x) != ncol(x)) stop("x must be a square matrix") :
argument is of length zero
outlier of EU_Sales
OutEU_Sales = outlier(dataset$EU_Sales, logical =TRUE)
Error in if (nrow(x) != ncol(x)) stop("x must be a square matrix") :
argument is of length zero
outlier of JP_Sales
OutJP_Sales = outlier(dataset$JP_Sales, logical =TRUE)
Error in if (nrow(x) != ncol(x)) stop("x must be a square matrix") :
argument is of length zero
outlier of other_sales
OutOS=outlier(dataset$Other_Sales, logical=TRUE)
Error in if (nrow(x) != ncol(x)) stop("x must be a square matrix") :
argument is of length zero
outlier of Global_sales
OutGS=outlier(dataset$Global_Sales, logical=TRUE)
Error in if (nrow(x) != ncol(x)) stop("x must be a square matrix") :
argument is of length zero
dataset= dataset[-Find_outlier,]
Error: object 'Find_outlier' not found
datsetWithoutNormalization<-dataset
dataset before normalization
normalize <- function(x) {return ((x - min(x)) / (max(x) - min(x)))}
dataset$NA_Sales<-normalize(datsetWithoutNormalization$NA_Sales)
dataset$EU_Sales<-normalize(datsetWithoutNormalization$EU_Sales)
dataset$JP_Sales<-normalize(datsetWithoutNormalization$JP_Sales)
dataset$Other_Sales<-normalize(datsetWithoutNormalization$Other_Sales)
dataset$Global_Sales<-normalize(datsetWithoutNormalization$Global_Sales)
min-max normalization we will use the min-max normalization; it’s better for visualization.
Our class label (popular) refers to Global_Sales. Other sales regions will be evaluated based on their importance to (global_sales) column. and those that are less important will be deleted from the dataset. use roc_curve area as score
roc_imp <- filterVarImp(x = dataset[,7:10], y = dataset$Global_Sales)
sort the score in decreasing order
roc_imp <- data.frame(cbind(variable = rownames(roc_imp), score = roc_imp[,1]))
roc_imp$score <- as.double(roc_imp$score)
roc_imp[order(roc_imp$score,decreasing = TRUE),]
we will rmove the (JP_Sales) because it is of low importance to our class_label(Global_Sales)
dataset<- dataset[,-9]
#Dataset after pre-processing